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Abstract 

This  paper  describes  a  Nonhydrostatic  Unified  Model  of  the  Atmosphere 
(NUMA)  based  on  a  spectral  element,  or  high-order  continuous  Galerkin 
(CG)  spatial  discretization  utilizing  3D  hexahedral  elements.  The  nonhydro- 
static  dynamical  core,  based  on  the  compressible  Euler  equations,  is  appropri¬ 
ate  for  both  limited-area  and  global  atmospheric  simulations.  In  this  paper, 
we  restrict  our  attention  to  3D  limited-area  phenomena;  global  atmospheric 
simulations  will  be  presented  in  a  follow-up  paper.  A  suite  of  explicit  and 
semi-implicit  time-integrators  is  presented.  Domain  decomposition  and  com¬ 
munication  algorithms  utilized  by  our  distributed  memory  implementation  is 
presented,  allowing  efficient  evaluation  of  the  the  direct  stiffness  summation 
(DSS)  operator.  Numerical  verification  of  the  model  is  performed  using  four 
test  cases:  1)  2D  inertia-gravity  waves,  2)  flow  past  a  3D  linear  hydrostatic 
mountain,  3)  flow  past  a  3D  nonlinear  mountain  and  4)  3D  buoyant  convec¬ 
tion  of  a  bubble  in  a  neutral  atmosphere;  these  tests  indicate  that  NUMA 
can  simulate  the  necessary  physics  of  a  dry  numerical  weather  prediction  dy¬ 
namical  core.  Scalability  for  the  explicit  dynamical  core  is  demonstrated  for 
12288  cores  on  TACC’s  Ranger  cluster,  while  the  semi-implicit  core  is  shown 
to  scale  to  4096  cores  on  the  same  architecture. 

Keywords:  compressible  flow,  Euler,  Lagrange,  Legendre,  Navier-Stokes, 
nonhydrost  at  ic ,  par  alleliz at  ion . 


1.  Introduction 

As  the  resolution  of  numerical  weather  prediction  (NWP)  models  increase, 
nonhydrostatic  effects  become  relevant.  Nonhydrostatic  dynamical  cores  al¬ 
low  grid  resolutions  ranging  from  a  few  hundred  kilometers  (mesoscale)  to 
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several  kilometers  (microscale)  to  approximately  100  meters  (large  eddy  sim¬ 
ulations).  Almost  all  current  limited-area  models  utilize  a  nonhydrostatic 
core,  while  global  models  are  currently  transitioning  from  the  hydrostatic 
to  the  nonhydrostatic  regime.  A  host  of  challenging,  non-trivial  numerical 
problems  arise  when  one  enters  the  nonhydrostatic  regime:  these  include  1) 
choosing  the  appropriate  equation  set  to  ensure  efficiency,  accuracy,  and  con¬ 
servation,  2)  effectively  resolving  multi-scale  flow  features,  that  may  require 
adaptive  mesh  refinement  (AMR),  3)  developing  efficient  time-integrators 
and/or  “soundproof’  equation  sets  to  confront  the  fast  acoustic  and  gravity 
waves  present  in  non- hydrostatic  models,  and  4)  developing  scalable  parallel 
codes  for  shared,  distributed,  and  hybrid  architectures  based  on  1)  -3). 

Virtually  all  current  nonhydrostatic  NWP  models  are  based  on  a  combi¬ 
nation  of  finite-difference  discretization  in  space  and  either  split-explicit  or 
semi-implicit  discretization  in  time.  Examples  include  WRF  [24]  (NCAR), 
Lokal  Modell  [30]  (DWD),  COAMPS  [18]  (US  Navy),  and  UM  [2]  (UK  Met 
Office).  Although  finite  difference  schemes  are  very  efficient,  they  suffer  from 
several  problems,  including:  1)  dispersion  error,  2)  geometrical  inflexibility, 
and  3)  lacking  of  scalability  to  large  (e.g.  tens  of  thousands)  of  processors. 
To  overcome  some  of  the  limitations  of  finite-difference  approximations,  sev¬ 
eral  emerging  NWP  models  utilize  finite-volume  spatial  discretization,  such 
as  MPAS  [38]  and  MOORE  [40],  which  utilize  polynomial  reconstruction 
of  the  inter-element  fluxes.  To  overcome  these  limitations,  element-based 
Galerkin  (EBG)  methods  have  recently  been  proposed  for  several  next  gen¬ 
eration  NWP  models  [12,  14,  28]  and  [5]  using  both  the  continuous  Galerkin 
(CG)  and  discontinuous  Galerkin  (DG)  formulations. 

EBG  methods  possess  several  desirable  attributes,  such  as  1)  higher-order 
accuracy,  2)  geometrical  flexibility,  whereby  the  solver  is  completely  indepen¬ 
dent  of  the  grid,  3)  excellent  dispersion  properties  [26]  and  4)  minimal  com¬ 
munication  overhead  within  a  parallel  implementation.  High-order  accuracy, 
which  allows  fine-scale  atmospheric  flow  features  to  be  resolved,  is  achieved 
by  representing  the  prognostic  variables  via  an  orthogonal  basis  function  ex¬ 
pansion  within  each  element.  Geometrical  flexibility,  which  is  inherent  to  all 
EBGs  (both  low  and  high  order),  is  advantageous  since  any  terrain-following 
coordinate  may  be  utilized  within  the  same  solver;  also,  both  static  and 
dynamic  adaptivity  may  be  retro-fitted  to  the  existing  solver.  Finally,  in  a 
distributed  memory  environment  (e.g.  cluster),  low  communication  overhead 
is  critical  to  maintain  linear  scalability  to  hundreds  of  thousands  of  proces¬ 
sor  cores.  In  our  previous  work,  higher-order  accuracy  and  geometrical  flex- 
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ibility  were  addressed  within  an  explicit  framework  [12]  and  semi-implicit 
framework  [14]  for  2D  (x-z  slices)  problems  using  a  serial  implementation. 
However,  parallel  implementation  was  not  explicitly  addressed.  The  purpose 
of  the  present  work  is  to  extend  the  work  begun  in  [12]  and  [14]  to  realis¬ 
tic  3D  domains  using  a  parallel,  MPI-based  implementation.  Although  we 
restrict  our  attention  to  limited-area  (mesoscale  and  large-eddy  scale)  simu¬ 
lations,  the  dynamical  core  developed  in  this  paper  may  also  be  applied  to 
global  atmospheric  simulations.  Hence,  we  are  considering  a  Nonhydrostatic 
Unified  Model  of  the  Atmosphere  (NUMA)  that  is  being  developed  in  con¬ 
junction  with  the  Naval  Research  Laboratory  (NRL)  in  Monterey,  CA;  to  our 
knowledge,  this  is  the  first  3D  spectral  element  model  for  a  nonhydrostatic 
atmosphere. 

This  work  is  guided  by  a  need  for  highly  scalable  models  in  distributed 
memory  environments.  In  the  past  five  years,  clock  speeds  of  processors 
have  remained  stagnant;  to  achieve  increased  floating-point  performance, 
chip  manufacturers  have  developed  multiple  core  processors  that  allow  many 
threads  to  execute  in  parallel.  In  tandem,  high  performance  computing 
(HPC)  has  evolved  towards  clusters  with  processors  counts  exceeding  100,000. 
As  we  approach  the  exaflop  era,  core  counts  are  expected  to  approach  1,000,000 
Therefore,  next-generation  NWP  models  must  be  based  upon  scalable  nu¬ 
merical  methods  that  allow  arbitrarily  large  processor  counts  with  minimal 
communication  overhead.  EBG  methods,  both  CG  and  DG,  have  proved 
effective  in  this  respect  in  modeling  massive  biological  flow  [15],  shallow  wa¬ 
ter  flows  [7],  incompressible  flows  using  low-order  finite  elements  [19],  and 
geodynamical  problems  [41]. 

This  paper  presents  a  scalable,  3D  nonhydrostatic  spectral  element  at¬ 
mospheric  model  targeted  toward  distributed  memory  architectures.  We 
are  developing  a  unified  dynamical  core  appropriate  for  both  mesoscale  and 
global  simulations.  The  current  implementation  utilizes  tensor  products  of 
Lagrange  polynomials  in  a  hexahedral  grid  for  maximum  computational  effi¬ 
ciency;  however,  more  flexible  grids  based  on  either  tetrahedra  or  triangular 
prisms  may  be  incorporated  into  future  versions  with  relative  ease.  The 
remainder  of  this  paper  is  structured  as  follows.  In  Section  2,  we  formu¬ 
late  the  nonhydrostatic  compressible  Euler  equations  (Set  2NC  from  [14]), 
which  constitute  the  governing  equations  of  our  dynamical  core.  To  ensure 
numerical  stability,  these  nonhydrostatic  equations  are  solved  about  a  hydro¬ 
static  base  state.  In  Section  3,  we  present  the  spectral  element  discretization, 
along  with  explicit  and  semi-implicit  time-integration  schemes  and  the  nec- 
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essary  boundary  conditions.  Section  4,  which  forms  the  core  of  the  paper, 
outlines  the  parallelization  algorithm  for  both  the  explicit  and  semi-implicit 
time-integrators.  Numerical  results  for  a  2D  inertia-gravity  wave,  3D  linear 
hydrostatic  and  non-linear  mountains,  and  a  3D  rising  thermal  bubble  are 
shown  in  Section  5  along  with  the  results  of  scalability  experiments.  Scala¬ 
bility  for  the  explicit  code  is  demonstrated  to  12288  processor  cores  for  the 
explicit  code  and  4096  processor  cores  for  the  semi-implicit  code. 


2.  Governing  Equations 

We  consider  the  fully  compressible,  nonhydrostatic  Euler  equations  in 
non-conservative  form.  These  equations  (Set  2NC),  which  are  valid  for  spatial 
resolutions  finer  than  10  km,  have  previously  been  considered  within  a  semi- 
implicit  framework  in  [14]  for  2D,  limited-area  atmospheric  flows.  Of  the  five 
equation  sets  considered  in  [14],  set  2NC  proved  to  be  both  computationally 
efficient  and  provides  acceptable  mass  and  energy  conservation  properties;  in 
addition,  replacing  the  advection  operator  in  the  momentum  equation  allows 
for  formal  conservation  of  energy  up  to  time  truncation  error.  Both  diabatic 
forcing  and  the  effects  of  moisture  are  neglected;  in  other  words,  we  consider  a 
dry  dynamical  core  without  sub-grid  scale  turbulence  closure.  In  the  present 
study,  we  consider  three-dimensional  flow  in  Cartesian  coordinates  ( x-y-z ) 
subject  to  gravitational  and  Coriolis  forces,  yielding 

^  +  V  •  (pu)  =  0  (la) 


(9l_l  1 

— ^  +  u  ■  Vu  -| — VP  +  gk  +  f  xu  =  0 
at  p 


u  •  V0  =  0  . 


(lb) 

(lc) 


where  the  prognostic  variables  are  (p,  ur,  6),  where  p  is  density,  u  =  (w,  v ,  w)T 
is  velocity,  and  6  is  potential  temperature.  In  addition,  P  is  pressure,  g  is  the 
gravitational  constant,  f  =  2Dk  is  a  Coriolis  parameter  (with  D  the  angular 
frequency  of  the  earth),  and  k  is  the  unit  vector  in  the  z  direction.  Eq.  (la) 
enforces  mass  conservation,  Eq.  (lb)  enforces  conservation  of  momentum, 
and  Eq.  (lc)  enforces  conservation  of  entropy.  To  close  the  system  of  conser¬ 
vation  laws  given  by  Eq.  (1),  a  thermodynamic  equation  of  state  is  required. 
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We  utilize  the  ideal  gas  law  given  by 

p^(f)7 

where  Pa  is  the  atmospheric  pressure  at  ground,  R  is  the  ideal  gas  constant, 
and  7  ~  1.4  is  the  ratio  of  specihc  heats.  In  three  dimensions,  Eqs.  (1)  and 
(2)  constitute  a  closed  system  of  nonlinear  PDEs  in  Eve  unknowns. 

To  facilitate  the  solution  of  the  compressible  Euler  equations  and  maintain 
numerical  stability,  we  split  the  density,  pressure,  and  potential  temperatures 
about  their  mean  hydrostatic  values: 


p(x,  z,  y,  t )  =  p0(z)  +  p'(x,  y,  z,  t) 

9(x,  y,  z,  t )  =  90(z)  +  &{x,  y,  z,  t ) 
P(x,  y,  z,  t )  =  PQ{z)  +  P'(x,  y,  z,  t ) 


(3a) 

(3b) 

(3c) 


where  p0 ,  90,  and  P0  are  the  hydrostatic  reference  states  that  depend  only 
on  the  vertical  distance  z.  Inserting  Eq.  (3)  into  Eq.  (1)  and  applying 
hydrostatic  balance 

dPo 

=  - Po9  (4) 


dz 


yields  the  system 


w+u'Vp'+rotr  +  (p,+',o)v'u=0 


(5a) 


du 

~dt 


u  ■  Vu 


p'  +  Po 


VP'  + 


P 


P'  +  Po 


cyk  +  f  x  u  =  0 


30  d90 

—  +  u  •  VP  +  w—  =  0. 
at  dz 


(5b) 

(5c) 


Defining  a  solution  vector  q  =  (//,  ur,  9')r ,  Eq.  (5)  is  written  in  condensed 
form  as 

w  =  S(q)  (6) 

where  the  source  term  S'(q)  is  a  nonlinear,  first-order  differential  operator. 
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3.  Numerical  Methods 

In  this  section,  we  briefly  discuss  the  numerical  discretization  used  by 
NUMA:  1)  spatial  discretization  of  Eq.  (5)  using  a  continuous  Galerkin  (CG), 
or  spectral  element  method,  2)  a  suite  of  explicit  and  semi-implicit  time- 
integrators,  and  3)  the  necessary  boundary  conditions  required  for  limited- 
area  problems. 

3.1.  Spatial  Discretization 

The  spectral  element  method  decomposes  the  spatial  domain  12  C  1Z 3 
into  Ne  disjoint  elements  12e  via 

Ne 

12  =  |J  ne.  (7) 

e=l 

In  the  current  formulation  of  NUMA,  we  let  12e  be  hexahedra,  which  provides 
1)  simple  grid  generation  and  2)  efficient  (fast)  evaluation  of  the  necessary 
differentiation  and  integration  operators.  We  note,  however,  that  12e  may  be 
replaced  by  either  1)  tetrahedra  or  2)  triangular  prisms  in  a  future  version 
of  NUMA. 

Letting  the  unit  cube  (£,  rj,  ()  E  E  =  [—1,  l]3  be  the  reference  hexahedra! 
element,  a  transformation  Te  :  12e  — >  E  mapping  physical  space  to  com¬ 
putational  space  is  defined  for  each  element,  yielding  (. x,y,z )  =  J-e(£,  rjX)- 
Associated  with  Te  is  a  Jacobian  Je,  that,  in  conjunction  with  the  decompo¬ 
sition  in  Eq.  (7)  allows  globally  defined  integrals.  We  express 

,  Ne 

/  /(x)d!2  =  /\  |Je|/(e)  (xe)  <212e  (8) 

Jn  e=i 

where  A^i  denotes  the  global  assembly,  or  direct  stiffness  summation  (DSS) 
operator  and  |  Je\  is  the  determinant  of  the  Jacobian  mapping.  Eq.  (8)  forms 
the  basis  for  all  CG  algorithms;  moreover,  Eq.  (8)  is  responsible  for  the 
local  nature  of  CG  that  facilitates  parallelization.  This  aspect  of  CG  will  be 
explored  in  detail  in  Section  4. 

Within  each  element  Qe,  a  finite-dimensional  approximation  q^v  is  formed 
by  expanding  q(x,  t)  in  basis  functions  tpj  (x)  such  that 

mn 

qiv(x,  t)  (9) 

3= 1 
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where  M ^  =  (iV+1)3  is  the  number  of  nodes  per  element  and  N  is  the  order  of 
the  basis  functions.  The  discrete  solution  qN  is  assumed  C°(D)  continuous 
across  inter-element  boundaries.  For  basis  functions,  we  construct  tensor 
products  of  Lagrange  polynomials  given  by 


tpi  (x)  =  ha( f)  <g>  hp(rj)  <g>  hf( C)  (10) 

where  ha(£)  is  the  Lagrange  polynomial  associated  with  the  Legendre- Gauss- 
Lobatto  (LGL)  points  G  and  ()  are  functions  of  the  physical  variable 
x.  These  LGL  points  satisfy 

(i-e2A(o  =  o  (ii) 


where  Pn(£)  is  the  iV-th  order  Legendre  polynomial.  Approximating  the 
prognostic  vector  q(x,  t)  by  a  finite-dimensional  approximation  in  Eq. 
(9),  multiplying  by  a  basis  function  Gi  and  integrating  over  the  domain 
yields  the  weak  form 

[  Vtf-xy-  dQ  =  f  faS  (qw)  dtt  (12) 

Jn  ut  Jn 

where  we  have  replaced  the  index  i  with  /  to  emphasize  that  Eq.  (12)  is  now 
a  global  representation.  Applying  the  Galerkin  expansion  given  by  Eq.  (9) 
to  Eq.  (12)  yields  the  matrix-vector  equation 

(13) 

where  the  mass-matrix  M/j  =  fQ  dfl  is  diagonal  if  the  interpolation 
and  integration  points  are  co-located.  This  approximation  is  valid  for  iV  >  4 
while  incurring  a  small  error  in  integration  [9].  Denoting  the  right-hand  side 
(RHS)  of  Eq.  (13)  by  Ri(qi ),  Eq.  (13)  is  expressed  as 


e=l 


Note  that  the  DSS  operator  maps  local,  element-wise  coordinates  i  to  global 
coordinates  I.  Eq.  (14)  forms  the  core  of  the  spectral  element  method,  al¬ 
lowing  local,  element-wise  information  to  propagate  to  adjacent  elements 
via  the  DSS  operator.  In  Section  4,  we  discuss  the  efficient  evaluation  of 
Eq.  (14)  within  a  distributed  memory  architecture.  Moreover,  the  choice 
of  tensor  product  basis  functions,  coupled  with  inexact  integration,  allows 
the  right-hand  sides  5(q)  to  be  evaluated  with  0(NeN4)  work  [6],  thereby 
resulting  in  an  efficient  algorithm  [6]. 
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3.2.  Temporal  Discretization 

In  the  following  section,  we  discuss  two  schemes  used  to  evolve  Eq.  (6) 
forward  in  time:  an  explicit  Runge-Kutta  RK-35  method  and  a  semi-implicit 
BDF2  method.  A  fully- implicit,  Jacobian-Free  Newton  Krylov  (JFNK)  time- 
integrator  [25]  and  a  family  of  semi-implicit  RK  methods  were  also  imple¬ 
mented;  however,  we  defer  discussing  these  results  and  will  report  their  rel¬ 
ative  strengths  and  weaknesses  in  future  work. 

3.2.1.  Explicit  RK  methods 

We  implemented  a  strong  stability  preserving  (SSP)  Runge-Kutta  third- 
order,  five  stage  time- integrator  proposed  by  [36].  This  time- integrator  is 
stable  for  Courant  numbers  of  1.3  or  less. 

3.2.2.  Semi- Implicit  BDF2  (Schur  Form) 

Semi-implicit,  or  IMEX  schemes,  decompose  the  operator  S'(q)  in  Eq.  (6) 
into  two  components:  a  linear  term  and  and  a  nonlinear  term.  This  decom¬ 
position  exploits  the  underlying  physics  of  the  compressible  Euler  equations: 
namely,  that  the  fast  moving  acoustic  and  gravity  waves  are  linear,  while 
the  slower  advective  waves  are  nonlinear.  Hence,  the  fast  moving  waves  may 
be  discretized  implicitly,  while  the  slower  advective  waves  are  discretized  ex¬ 
plicitly.  Hence,  the  semi-implicit  schemes  allow  much  larger  time-steps  than 
explicit  schemes,  since  the  Courant  number  is  only  restricted  by  the  advec¬ 
tive  dynamics.  Semi-implicit  discretizations  of  spectral-element  atmospheric 
models  was  recently  developed  in  [14];  a  brief  outline  is  provided  in  this 
section  for  completeness. 

First,  Eq.  (6)  is  rewritten  as  follows: 

^  =  [^(q)  -<S£(q)]  +  ^(q)  (15) 

where  L(q)  is  the  linearized  Euler  operator  specified  in  Appendix  A  of  [14] 
and  5  =  0,  1  is  a  switch.  The  operator  L(q)  is  responsible  for  1)  acoustic, 
2)  barotropic  gravity  and  3)  baroclinic  gravity  waves.  By  setting  6  =  0,  we 
recover  an  explicit  scheme,  while  for  6  =  1,  we  have  the  semi-implicit  scheme. 
To  discretize  Eq.  (15)  in  time,  a  general  backward  difference  (BDF)  formula 
is  utilized,  yielding 

K- 1  K—l 

qn+1  =  akqn~k+^At  Pk  [S(qn_fc)  -  SL( qn“fc)]  +^At5L(qn+1)  (16) 

k= 0  k= 0 
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where  1  <  K  <  6  is  the  order  of  the  time-integrator.  Eq.  (16)  is  rewritten 
in  terms  of  the  variables  defined  in  Section  3  of  [14]  yielding 

qtt  +  AL(qtt)  =  q  (17) 

where  A  =  <TyA t.  Eq.  (17)  is  immediately  realized  as  a  system  of  linear 
equations  Ax  =  b,  where  the  left-hand  side  (LHS)  matrix  A  is  given  by 
A  =  I  +  XL  and  right-hand  side  (RHS)  vector  q.  Examining  Eq.  (17)  in  con¬ 
junction  with  [14],  the  semi-implicit  problem  consists  of  the  following  three 
steps:  1)  solve  the  explicit  problem  and  form  q,  2)  solve  the  linear  system 
given  by  Eq.  (17)  for  the  intermediate  variable  q#,  and  3)  backsubstitute  to 
determine  the  update  qn+1. 

The  majority  of  the  computational  work  is  spent  in  solving  Eq.  (17)  via 
an  iterative  Krylov  subspace  technique,  such  as  GMRES  [29].  Noting  that 
this  linear  system  is  size  5 Ng  x  5 Ng  for  the  3D  Euler  equations,  the  cost 
of  an  iterative  solve  is  0(25k2N2),  where  Ng  is  the  number  of  global  grid 
points  and  k  is  the  number  of  Krylov  iterations.  Hence,  the  cost  of  solving 
Eq.  (17)  may  become  prohibitive  even  for  relatively  small  3D  problems.  To 
mitigate  this  problem,  the  size  of  the  linear  system  may  be  reduced  from  5 Ng 
to  Ng  via  the  Schur  complement  approach,  where  the  first-order  system  of 
5 Ng  equations  is  reduced  to  a  second-order  system  of  Ng  equations  written 
in  terms  of  the  linearized  discrete  pressure  Pu ■  Symbolically,  we  solve 

NPtt  =  Rtt.  (18) 

where  H  =  H  (D,  Dr)  is  a  linear,  pseudo-Helmholtz  operator  consisting 
of  gradient  D  and  divergence  Dr  operators  operating  on  the  discretized 
pressure  Pu  and  Rtt  is  an  effective  source  term.  Explicit  expressions  for  7i 
and  Ru  are  given  by  Eq.  (A. 10)  in  [14].  We  note  that  the  gradient  operator 
may  be  written  as  a  DSS  of  local  gradient  operators  using  the  notation 

Ne 

D  =  /\  D(e)  (19) 

e=l 

which  is  crucial  for  the  construction  of  a  parallel,  semi-implicit  algorithm 
(see  Section  4). 

3.3.  Boundary  Conditions 

Limited-area  atmospheric  models,  such  as  mesoscale  codes,  typically  re¬ 
quire  two  types  of  boundary  conditions:  1)  no- flux  boundary  conditions 
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(NFBCs),  that  mimic  impenetrable  objects  (e.g.,  the  ground)  and  2)  non¬ 
reflecting  boundary  conditions  (NRBCs),  that  mimic  an  infinite  domain  (e.g. 
the  top  of  the  atmosphere)  by  allowing  waves  to  propagate  out  of  the  com¬ 
putational  domain  without  generating  spurious  reflections.  In  this  section, 
we  outline  how  NFBCs  are  imposed  via  projection  matrices  and  how  NRBCs 
are  imposed  via  sponges.  For  additional  details,  see  [12]. 

3.3.1.  No- Flux  Boundary  Conditions 

All  of  our  test  cases  utilize  NFBCs  on  the  bottom  boundary,  while  some 
test  cases  (such  as  the  rising  thermal  bubble)  utilize  NFBCs  on  other  bound¬ 
aries  as  well.  For  NFBCs,  we  enforce 

n  •  u  =  0  (20) 

for  all  points  on  the  boundary  T,  where  n  is  the  outward  pointing  unit  normal 
on  T.  In  order  to  apply  the  NFBC  to  the  prognostic  vector  q,  we  augment 
n  to  R5  via  n  =  (0,  nr,  0 )r,  yielding  n  •  q  =  0.  To  apply  theses  boundary 
conditions  in  the  strong  sense,  we  construct  a  3  by  3  projection  matrix  P  via 

(1  -  n2x  ~nxny  -nxnz  \ 

-nynx  1  -  n2y  -nynz  .  (21) 

~nznx  -nzny  1  -  n2z  ) 

This  matrix  is  constructed  during  the  initialization  phase  and  applied  to  the 
RHS  of  Eq.  (16)  after  each  time-step. 


3.3.2.  Non- Reflecting  Boundary  Conditions 

In  an  operational  NWP  model,  the  four  lateral  and  the  top  boundary  of 
any  mesoscale  model  should  mimic  an  open  domain.  That  is,  waves  should 
smoothly  exit  the  domain  without  reflection;  in  addition,  information  from 
outside  the  domain  should  be  allowed  to  enter  the  domain  of  interest.  Math¬ 
ematically  modeling  this  behavior  is  non-trivial  and  has  attracted  the  atten¬ 
tion  of  researchers  in  many  domains  [3,  17,  27].  In  our  model,  we  utilize 
a  simple,  albeit,  effective  absorbing  sponge  layer  method.  The  computa¬ 
tional  domain  is  surrounded  by  a  layer  with  Newtonian  relaxation  coefficients 
a(x,  y,  z )  and  /3(x,  y,  z )  such  that  a  =  1  and  0  —  0  in  the  domain  of  interest, 
while  a  — >  0  and  (3  — »  1  at  the  boundary.  Specifically,  for  the  top  boundary, 
we  choose 

/  \  4 


P 


z  —  zq 


Zt  -  Z s 


(22) 
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where  zt  is  the  vertical  height  of  the  domain  and  is  the  bottom  of  the  sponge 
layer.  Similar  functions  are  used  for  the  lateral  boundaries  of  the  domain. 
Once  the  sponge  layer  is  constructed,  the  numerical  solution  q  given  by  the 
RHS  of  Eq.  (16)  is  relaxed  to  some  known  solution  at  the  boundary  q;,  via 

q  =  a(x,y,z)q  + f3(x,y,z)qb.  (23) 

For  problems  under  consideration  in  this  paper,  q^  is  a  far-held  condition 
(e.g.  known  wind  velocity). 

4.  Parallel  Implementation 

NUMA  is  an  MPI-based  code  targeted  toward  distributed  memory  ar¬ 
chitectures  (e.g.  clusters).  In  this  section,  we  discuss  the  parallel  imple¬ 
mentation  of  NUMA,  including  a  description  of  the  domain-decomposition, 
necessary  data  structures  (e.g.  local  to  global  mappings),  and  communication 
algorithms. 

4-1.  Domain  Decomposition 

We  decompose  hi  into  Np  processor  elements  (PE)  Llp  that  consist  of  local 
elements  .  Mathematically,  we  rewrite  Eq.  (7)  as 

Np  wfp) 

n  =  U  U  (24) 

p= 1  e'=l 

where  NeP '  is  the  number  of  local  elements  residing  on  PE  p.  Since  the  DSS 
operator  acts  on  global  elements  e,  we  must  also  construct  local  to  global 
mappings  e  =  LG ^  (e')  that  map  local  elements  e'  on  processor  p  to  global 
elements  e  residing  on  the  global  domain  Q. 

A  guiding  principle  in  the  construction  of  NUMA  is  to  maintain  indepen¬ 
dence  between  grid  generation  and  the  spectral  element  solver;  hence,  domain 
decomposition  should  be  as  general  as  possible  and  not  constrained  by  the 
underlying  grid  connectivity.  Therefore,  we  have  implemented  a  domain  de¬ 
composition  strategy  based  on  the  widely  used  METIS  graph  partitioning 
library  [22],  METIS  requires  an  adjacency  graph  where  the  Ne  vertices  are 
elements  fle  and  the  “edges”  denote  the  connectivity  between  elements.  Since 
the  DSS  operator  requires  information  from  elements  that  share  nodes,  the 
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“edges”  include  all  forms  of  geometric  connectivity  (faces,  edges,  and  ver¬ 
tices).  For  maximum  flexibility,  we  constructed  a  weighted  adjacency  graph 
G'  =  (V,  E)  with  adjacency  matrix  A'  of  size  Ne  by  Ne  defined  via 

{1  if  i  and  j  are  vertex  neighbors 
2  if  i  and  j  are  edge  neighbors  (25) 

4  if  i  and  j  are  faces  neighbors 

The  weights  in  Eq.  (25)  represent  the  number  of  common  points  between 
neighbors  assuming  linear  (TV  =  1)  elements;  these  weights  are  arbitrary  and 
may  be  altered  to  construct  a  machine-optimal  weighted  adjacency  matrix. 
Once  A1  is  constructed,  the  adjacency  matrix  A  for  the  graph  G  =  ( V,F ), 
where  F  are  geometrical  faces  is  simply  =  1  if  ah  =  4  and  al}  =  0 
otherwise.  An  example  connectivity  graph  for  a  2D  grid  is  shown  in  Figure  1, 
showing  both  edge  and  vertex  connectivity.  The  associated  9  by  9  adjacency 
matrix  has  nodes  with  degree  ranging  from  3  (for  the  corner  nodes)  to  8  (for 
the  central  node).  This  standard  adjacency  matrix  A  may  be  utilized  within 
a  discontinuous  Galerkin  (DG)  framework  [11],  where  the  only  inter-element 
communication  is  between  adjacent  faces  via  flux  operators. 

Since  we  are  considering  a  local  method,  both  A  and  A'  are  sparse.  There¬ 
fore,  these  matrices  are  stored  in  compressed  storage  (GSR)  format  using  an 
adjacency  list  adjncy  of  length  2\E\  and  array  xadj  of  length  \V\  +  1.  These 
arrays,  along  with  the  number  of  processor  Np  are  then  passed  to  METIS, 
which  returns  a  partition  V  :  V  — >  {1,2,  ...,NP},  that  maps  global  elements 
to  processors.  This  mapping  is  then  used  to  construct  the  local  to  global 
mappings  LG  necessary  for  global  assembly  in  Eq.  (14). 

In  addition  to  the  element  adjacency  graph  G,  a  processor-element  adja¬ 
cency  graph  Gp  =  ( Vp,Ep )  is  constructed,  where  the  vertices  Vp  are  pro¬ 
cessor  elements  and  the  edges  Ep  are  the  connectivity  between  processor 
elements.  Again,  since  we  are  considering  a  CG  method,  Ep  includes  vertex, 
edge,  and  face  connections  between  processor  elements.  The  adjacency  ma¬ 
trix  Np  by  Np  adjacency  matrix  A ^  associated  with  Gp  is  derived  from  A! 
as  follows:  the  element  a-f'1  =  1  if  the  intersection  of  all  rows  i'  and  columns 
j'  of  A'  such  that  i  =  V(i')  and  j  =  V(j')  has  at  least  one  nonzero  element; 
otherwise,  a[P  =  0.  From  the  inter-processor  adjacency  matrix  A^p\  the 
necessary  communication  data  structures  are  constructed  with  ease.  Specif¬ 
ically,  the  neighbors  of  processor  element  i  are  simply  the  non- zero  columns 
of  row  i,  while  the  number  of  neighbors  for  element  i  is  given  by  Ylf=i  ap  • 
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Figure  1:  Example  2D  grid  (left)  and  the  associated  adjacency  graph.  Since  spectral 
element  methods  utilize  nodal  communication,  the  adjacency  graph  includes  both  edge 
and  vertex  connectivity,  with  a  maximum  degree  of  eight.  For  3D,  structured,  Cartesian 
grids,  the  maximum  degree  is  26. 

Thus,  a  low-communication  partition  will  have  an  adjacency  matrix 
that  is  as  sparse  as  possible. 

4-2.  Parallelization:  Explicit  TI 

We  first  consider  the  parallelization  of  the  explicit  time-integrator,  due  to 
its  simplicity.  The  global  DSS  operator  in  Eq.  (14)  requires  inter-processor 
communication  due  to  the  overlap  of  elements  at  processor  element  bound¬ 
aries.  A  global  DSS  is  required  in  two  parts  of  the  code:  1)  construction 
of  the  mass  matrix  and  2)  construction  of  the  right-hand  side  (RHS).  To 
perform  this  communication,  we  first  construct  the  boundary  nodes  of  each 
processor  (excluding  the  physical  boundary  where  BCs  are  applied).  Denote 
the  boundary  of  processor  element  i  by  dfli.  In  order  to  communicate  be¬ 
tween  processor  elements,  we  construct  two  data  structures  send  and  recv. 
Consider  a  particular  processor  i  and  neighboring  processor  j .  send  contains 
the  local  nodes  on  processor  i  that  are  sent  to  each  neighboring  processor  j, 
while  recv  contains  the  local  nodes  on  processor  j  that  must  be  sent  to  i.  In 
order  to  construct  send,  we  utilize  the  method  shown  in  Algorithm  1.  The 
corresponding  recv  structure  contains  the  same  grid  points  as  send,  although 
they  may  be  ordered  differently. 

Once  these  data  structures  have  been  constructed,  the  global  mass  matrix 


13 


Algorithm  1  Construction  of  MPI  send/receive  communication  data  struc¬ 
tures. _ 

for  all  NBHs  j  of  i  do 
MPISEND  dnf  <-  LG®  (dnj  to  proc  j 
MPIRECV  dflf  <-  LG®  (d%)  to  proc  i 

b  <—  onf  n  dnf 

send(j)  <- 

end  for 


and  RHS  operators  may  be  constructed  via  a  two  step  process.  The  global 
mass  matrix  M/j  and  RHS  operator  in  Eq.  (14)  is  decomposed  as 


Ne 

NP 

m,j  =  f\  mV  = 

A  A  MiP and 

(26a) 

e=l 

np= 1  e'=l 

Ne 

Np  N(ep) 

Ri  =  A  A’ 

=  A  A 

(26b) 

e—1 

np= 1  e'=l 

Hence,  the  global  DSS  is  decomposed  into  a  local,  on-processor  DSS  and  a 
global  DSS,  that  requires  inter-processor  communication.  In  this  way,  global 
continuity  between  elements  is  preserved  during  the  construction  of  each 
RHS.  Note  that  the  communication  stencil  is  simply  the  boundary  of  each 
processor  element  dn®  and  is  independent  of  the  polynomial  order  N ;  that 
is,  unlike  higher-order  finite  difference  or  finite  volume  methods,  the  spectral- 
clement  method  is  halo-free.  In  summary,  the  global  DSS  procedure  may  be 
summarized  as  follows: 

1.  Perform  a  local  DSS  on  processor  i. 

2.  Exchange  boundary  points  between  processors  i  and  all  processor  neigh¬ 
bors  j  using  send  and  recv. 

3.  Perform  a  global  DSS  using  the  boundary  data  received  from  neighbors 

j- 

The  global  DSS  operation  is  represented  schematically  in  Figure  2  in  a  2D 
setting.  To  simplify  the  discussion,  each  processor  is  assumed  to  own  one  cl¬ 
ement.  In  order  to  construct  the  RHS  operator  on  the  boundary  (red  dots), 
the  element  E  (green)  requires  nodal  information  for  its  8  nodal  neighbors. 
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These  neighbors  include  both  edge  neighbors  (2,  4,  6,  and  8)  and  vertex 
neighbors  (1,  3,  5,  and  7).  In  a  3D  setting,  an  element  may  have  (a  max¬ 
imum)  of  6  face  neighbors,  12  edge  neighbors,  and  8  vertex  neighbors,  for 
a  total  of  26  nodal  neighbors.  To  reduce  this  communication  cost,  METIS 
is  utilized  to  reduce  the  total  number  of  vertex  neighbors;  in  a  typical  3D 
partition,  a  processor  element  lying  away  from  a  physical  boundary  has  16  or 
less  nodal  neighbors;  hence,  the  METIS-based  decomposition  further  reduces 
communication  cost  relative  to  a  naive  geometric  domain  decomposition. 

Spectral  elements,  and  in  fact  all  Galerkin  based  methods,  possess  purely 
local  communication  stencils.  Referring  to  Fig.  2,  the  interior  nodes  (yellow 
dots)  do  not  need  to  be  communicated  to  adjacent  processors,  resulting  in  a 
halo-free  algorithm.  Thus  unlike  high-order  finite  difference  and  finite  volume 
schemes,  spectral  elements  may  achieve  spectral  accuracy  without  sacrific¬ 
ing  their  local  character.  We  note  in  passing  that  discontinuous  Galerkin 
(DG)  methods  further  reduce  communication  costs,  since  elements  need  only 
communicate  with  face  neighbors;  we  will  report  the  results  of  our  DG  im¬ 
plementation  in  a  future  paper. 

4-3.  Parallelization:  Semi- Implicit  TI 

Unlike  the  explicit  RK-35  TI,  the  semi-implicit  TI  requires  the  solution 
of  the  linear  system  given  by  Eq.  (18)  at  each  time-step.  First,  the  effective 
source  term  Rtt  must  be  constructed,  which  requires  applying  the  divergence 
operator  DT.  To  solve  the  linear  system,  the  Krylov-space  solver  GMRES  is 
utilized,  which  requires  constructing  a  set  of  orthonormal  basis  vectors  Vj  that 
spans  the  solution  space  by  1)  constructing  the  matrix-vector  product  v  = 
7 dvj  at  each  iteration,  and  2)  constructing  the  orthonormal  vector  Vj+i  from 
v  via  orthogonalization  of  Krylov  vectors.  After  the  discrete  pressure  Ptt  is 
constructed,  additional  differential  operators  D  must  be  applied  to  construct 
the  solution  qn+1.  Each  of  these  operations  require  MPI  communication, 
which  is  outlined  in  this  section. 

In  constructing  both  the  LHS  and  RHS  operators  in  Eq.  (18),  the  dif¬ 
ferential  operator  given  by  Eq.  (19)  must  be  applied  to  the  Krylov  vector 
Vj.  Analogous  to  the  construction  of  the  RHS  operator  in  Eq.  (26b),  the 
differential  operator  is  decomposed  into  a  local,  or  on-processor  DSS,  and 
a  global  DSS  that  requires  inter-processor  communication.  The  communi¬ 
cation  stencil  for  the  global  DSS  is  the  same  as  the  explicit  RHS;  that  is, 
the  computation  of  derivatives  is  halo-free  within  the  continuous  Galerkin 
paradigm. 
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Figure  2:  Global  DSS  operator  in  a  2D  setting,  where  each  processor  owns  one  element. 
The  element  E  (green)  requires  nodal  information  for  its  8  nodal  neighbors  in  order  to 
construct  the  RHS  operator  on  the  boundary  (red  dots).  However,  the  interior  nodes 
(yellow  dots)  do  not  need  to  be  communicated  to  adjacent  processors,  resulting  in  a  halo- 
free  algorithm. 
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In  addition  to  differentiation  operators,  the  orthogonalization  procedure 
requires  communication.  In  the  modified  Gram-Schmidt  procedure,  k  global 
dot  products  cq  =  vTVj  must  be  performed  at  the  k-th  Krylov  iteration, 
where  1  <  i  <  k.  These  dot  products  1)  depend  on  the  global  vector,  and  2) 
are  required  by  each  processor  element;  hence,  the  computation  of  dot  prod¬ 
ucts  is  an  all-to-all  communication  that  becomes  prohibitive  for  large  pro¬ 
cessor  counts.  To  mitigate  this  communication  overhead,  a  communication- 
avoiding  GMRES  solver  [4]  will  be  incorporated  in  our  next-generation  semi- 
implicit  solver. 

5.  Results 

5.1.  Test  Cases 

Although  a  standard  set  of  2D  mesoscale  test  cases  has  been  proposed 
[31]  and  later  utilized  within  an  element-based  Galerkin  framework  [12],  an 
analogous  suite  of  3D  test  cases  has  not  yet  been  developed.  Fortunately, 
a  3D  dynamical  core  can  be  run  in  2D  mode  by  imposing  symmetry  in  the 
y-dimension  and  periodic  boundary  conditions  in  the  //-lateral  boundaries. 
For  initial  verification  of  NUMA,  we  utilized  the  test  cases  proposed  in  [31]; 
later,  we  ran  full  3D  test  cases  with  no  //-symmetry.  In  the  following  analysis, 
we  consider  four  test  cases:  1)  2D  inertia-gravity  waves,  2)  a  3D  linear  hy¬ 
drostatic  mountain,  3)  a  3D  nonlinear  mountain  and  4)  a  3D  rising  thermal 
bubble. 

5.1.1.  2D  Inertia- Gravity  Waves 

The  2D  nonhydrostatic  inertia-gravity  wave  simulates  the  evolution  of 
a  potential  temperature  perturbation  in  a  channel  with  periodic  boundary 
conditions;  this  test  case  was  originally  proposed  in  [32],  To  adapt  this 
problem  to  3D,  symmetry  in  the  (/-dimension  was  enforced  along  with  pe¬ 
riodic  boundary  conditions  in  the  //-dimension.  The  domain  is  defined  as 
(T,  //,  z )  G  [0,  300000]  x  [0,  300000]  x  [0, 10000]  m.  No-flux  boundary  con¬ 
ditions  are  applied  at  the  top  and  bottom  of  the  domain,  while  periodic 
boundary  conditions  are  applied  on  the  lateral  boundaries. 

5.1.2.  3D  Linear  Hydrostatic  Mountain 

To  test  the  3D  capabilities  of  NUMA  and  the  NRBCs,  we  consider  strat¬ 
ified  flow  past  an  isolated  mountain  as  outlined  in  [33] .  An  initial  horizontal 
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flow  U  —  20  m/s  blows  past  a  mountain  with  orography  given  by 


h(x,y)  =  k°  9  ,  (27) 

{x2/a2  +  y2  /  a2  +  lR/2 

with  mountain  half-width  a  =  10  km  and  height  ho  =  1  m.  A  profile  of  the 
linear  hydrostatic  mountain  is  shown  in  Figure  3,  where  the  2-axis  has  been 
amplified  for  illustrative  purposes.  The  hydrostatic  background  is  specified 
by  a  constant  Brunt- Vaisala  frequency  Nfn,  =  gf  \J cpT0  with  ground  temper¬ 
ature  To  =  250  K.  In  other  words,  we  consider  an  isothermal  atmosphere.  No 
flux  boundary  conditions  are  imposed  on  the  bottom  of  the  domain,  while 
NRBCs  are  imposed  on  the  four  lateral  boundaries  and  the  top  boundary.  In 
order  to  verify  our  numerical  results,  several  analytical  results  from  [33]  and 
[34]  are  utilized.  First,  a  contour  integral  solution  for  the  density  perturba¬ 
tions  p'  valid  under  a  linear  Boussinesq  approximation  is  utilized.  Also,  the 
velocity  perturbations  parallel  and  perpendicular  to  the  flow  ( u '  and  v')  are 
known  for  observation  points  near  the  ground  ( z  =  0)  (see  Eqs.  (39)  and 
(41)  in  [33]): 


u(x,y,0  )  =  ^1+xVj  +  !/2/a2 

(28a) 

v  (x.  y.  °)  =  hNbv  J  ^  a.2/a2  y2/a2 

(28b) 

under  the  same  approximation. 

5.1.3.  3D  Nonlinear  Mountain 

Additional  tests  were  performed  by  increasing  the  height  of  the  mountain 
in  Eq.  (27)  from  ho  =  1  m  to  ho  =  1000  nr,  causing  nonlinear  effects  such  as 
flow  splitting.  All  other  parameters  were  held  constant. 


5.1. 4-  3D  Rising  Thermal  Bubble 

We  consider  a  3D  buoyant  thermal  bubble  rising  in  a  neutrally  stratified 
atmosphere  [37],  which  is  the  3D  extension  of  a  2D  thermal  bubble  originally 
considered  in  [35].  The  hydrostatic  potential  temperature  90(z )  =  300  K 
(neutral  atmosphere)  is  perturbed  with  a  sphere  of  radius  rc  =  250  m  centered 
at  ( xc ,  yc,  zc )  =  (500,  500,  260)  m  by  a  cosine  taper  given  by 


O' 


A 


1  +  cos 


(29) 
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Figure  3:  Profile  of  the  isolated,  3D  linear  hydrostatic  mountain  defined  by  Eq.  (27) 
originally  considered  in  [33] 
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where  r  =  \J{x  —  xc )2  +  (y  —  pc)2  +  (z  —  zc)2  and  A  is  a  constant.  Unlike 
the  test  case  utilized  in  [37],  our  problem  has  a  C1  initial  condition,  thus 
mitigating  unphysical  oscillations  associated  with  spectral  element  methods. 
The  domain  is  defined  as  {x,  y ,  z)  G  [0, 1000]  x  [0, 1000]  x  [0, 1500]  m.  No-flux 
boundary  conditions  are  applied  on  all  six  boundaries. 

5.2.  Numerical  Verification 

The  numerical  verification  proceeds  in  three  steps.  In  phase  one,  we  ran 
the  code  in  pseudo-2D  mode  using  1  element  and  periodic  boundary  condi¬ 
tions  in  the  ^/-direction.  The  numerical  results  for  the  standard  mesoscale 
suite  [31]  are  directly  compared  to  the  results  of  our  existing  2D  model.  In 
phase  two,  we  considered  the  linear  isolated  mountain  problem,  which  pos¬ 
sess  an  approximate  analytical  solution  in  the  form  of  a  contour  integral.  In 
phase  three,  we  consider  flows  over  nonlinear  mountains  with  height  of  1  km 
and  three-dimensional  buoyant  convection.  Although  these  problems  do  not 
have  analytical  solutions,  their  qualitative  behavior  is  understood. 

Figure  4  displays  results  from  NUMA  for  the  2D  Inertia-Gravity  Waves 
test  case.  Fig.  4(a)  shows  the  potential  temperature  perturbation  O'  after 
2500  s  for  250  m  resolution  (120  by  1  by  4  elements)  and  10-th  order  polyno¬ 
mials  for  a  slice  in  the  center  of  the  domain.  Fig.  4(b)  shows  profiles  along 
z  =  5000  m;  note  the  symmetry  about  x  =  150000  m.  These  3D  results 
agree  to  eight  decimal  places  of  precision  with  earlier  results  generated  by  a 
2D  spectral  element  model.  From  this  initial  test,  we  conclude  that  NUMA 
is  capable  of  1)  capturing  the  propagation  of  gravity  waves  in  a  stratified 
medium  and  2)  mimicking  the  results  of  a  purely  2D  model. 

In  order  to  test  the  3D  operators,  orography,  and  non-reflecting  boundary 
conditions,  flow  over  a  3D  isolated  linear  hydrostatic  mountain  was  consid¬ 
ered.  This  problem  may  be  solved  under  the  linear  Bosusinesq  approximation 
via  a  contour  integral  technique  as  described  in  [34],  In  addition,  closed  form 
expressions  for  both  the  down-stream  and  cross-stream  velocity  components 
may  be  compared  to  the  numerical  solution.  We  used  20  spectral  elements 
in  the  x  and  y  dimensions  and  10  in  the  z  direction  with  eighth-order  poly¬ 
nomials  yielding  effective  resolutions  of  Ax  =  Ay  =1.5  km  and  Ax;  =  300 
m.  The  semi-implicit  TI  was  run  with  a  time-step  of  At  =  1  s  on  an  Apple 
XServe  cluster  using  10  processors. 

Figure  6  compare  the  density  perturbations  p’  using  both  of  these  ap¬ 
proaches.  In  panel  a),  contours  for  p’  are  shown  after  6  hours  of  simulation 
time  and  compared  to  the  contour  integral  solution.  Agreement  is  very  good 
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Figure  4:  Case  1:  2D  Inertia-gravity  waves.  Potential  temperature  perturbation  after  2500 
s  for  250  m  resolution  (120  by  1  by  4  elements)  and  10-th  order  polynomials,  a)  shows  a 
slice  in  the  center  of  the  domain  and  b)  shows  profiles  along  z  =  5000  m. 


near  the  mountain,  while  the  two  solutions  begin  to  deviate  as  z  increases 
due  to  the  influence  of  the  sponge.  Agreement  between  NUMA  and  Smith’s 
results  may  be  brought  into  closer  agreement  by  either  1)  increasing  the  res¬ 
olution  of  the  model  or  2)  running  the  model  longer.  Additional  verification 
was  performed  by  comparing  the  velocity  on  the  surface  of  the  mountain. 
The  down-stream  velocity  perturbation  u'  and  cross-stream  velocity  pertur¬ 
bation  v'  at  the  ground  are  compared  with  the  analytical  formulas  in  Eq. 
(28).  Figure  5  displays  the  results  of  this  comparison  after  t  —  4  hours  of 
integration.  Agreement  between  the  numerical  spectral  element  model  and 
the  analytical  formulas  given  by  Eq.  (28)  is  satisfactory,  especially  for  the 
cross-stream  velocity  perturbation. 

After  verification  in  an  isothermal  atmosphere,  the  hydrostatic  isolated 
mountain  problem  was  also  tested  in  a  neutral  atmosphere  with  9q(z)  =  250 
K.  To  our  knowledge,  there  is  no  analytical  solution  for  this  problem,  but 
the  results  are  qualitatively  similar  to  the  wave  patterns  generated  for  a 
2D  linear  hydrostatic  ridge.  The  three  velocity  components  are  shown  in 
Figure  7  after  20  and  60  minutes.  The  initially  horizontal  flow  creates  a 
discontinuity  and  triggers  vertically  propagating  acoustic  waves.  Note  that 
the  cross-stream  velocity  is  symmetric  with  respect  to  the  y-axis,  that  is 
expected  since  the  initial  cross-stream  velocity  is  zero.  As  time  evolves,  a 
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steady  state  mountain  wave  pattern  develops,  which  begins  to  converge  after 
3600  seconds  of  simulation  time.  These  tests  show  the  development  of  eddies, 
especially  on  the  lee  side  of  the  mountain.  Unlike  flows  in  atmospheres  with 
constant  Nbv,  the  velocity  components  do  not  attenuate  as  z  increases.  Since 
the  mountain  has  a  small  height  (1  m),  there  is  no  splitting  of  the  flow  around 
the  base  of  the  mountain.  Qualitatively,  these  3D  mountain  results  are  very 
similar  to  the  standard  2D  mountain  (or  ridge). 

Additional  tests  on  a  nonhydrostatic  mountain  ho  =  1000  m  were  also  per¬ 
formed  using  the  same  initial  horizontal  velocity.  By  increasing  the  height 
of  the  mountain  from  ho  =  1  nr  to  ho  =  1000  nr,  nonlinear  phenomena,  such 
as  flow  splitting  emerge.  Figure  8  shows  the  flow  (i.e.  velocity  vectors)  pro¬ 
duced  by  the  nonlinear  mountain.  Unlike  the  linear  hydrostatic  mountain 
analyzed  above,  this  problem  cannot  be  solved  by  analytical  means.  In  this 
test  case,  the  velocity  perturbations  are  much  larger  than  the  1  nr  mountain 
and  significant  eddy  mixing  develops  on  the  lee  side  of  the  mountain.  How¬ 
ever,  since  the  maximum  velocity  is  an  order  of  magnitude  smaller  than  the 
speed  of  sound,  a  relatively  large  tinre-step  may  be  utilized  within  a  senri- 
implicit  solver,  demonstrating  the  efficiency  of  the  Schur  SI- integrator.  This 
test  confirms  that  NUMA  is  capable  of  handling  topography;  in  future  tests, 
we  will  consider  flow  over  realistic  orography.  To  compare  the  velocity  fields 
of  the  nonlinear  mountain  with  the  linear  mountain,  the  horizontal  (u)  and 
vertical  (w)  velocity  perturbations  in  the  plane  y  =  0  are  shown  in  Figure  9. 
Note  that  the  magnitude  of  the  vertical  velocities  between  the  LHM  and  the 
nonlinear  mountain  differ  by  over  three  orders  of  magnitude. 

Finally,  the  results  of  the  buoyant  convection  (i.e.,  rising  3D  thermal  bub¬ 
ble)  experiment  are  shown  in  Figure  10.  A  smooth  potential  temperature 
perturbation  in  an  initially  neutral  atmosphere  generates  vertical  updrafts 
and  shear  velocities,  which  cause  the  bubble  to  rise,  deform,  and  transition 
to  turbulence.  The  rising  thermal  bubble  problem  tests  the  model’s  ability 
to  capture  1)  the  effects  of  turbulence  and  2)  turbulent  convection.  Although 
no  analytical  solution  exists  for  this  problem,  the  numerical  results  are  phys¬ 
ically  plausible  and  resemble  previous  3D  bubble  experiments  in  [37]  and  [1]. 
Similar  results  are  seen  for  both  the  explicit  and  semi-implicit  codes. 

5.3.  Comparison  of  Time- Integrators 

An  extensive  comparison  of  high-order  RK35  time-integrators  and  the 
semi-implicit  BDF2  integrator  was  performed  in  [14]  for  2D  nonhydrostatic 
problems.  In  this  section,  we  briefly  extend  this  analysis  to  3D,  where  we 
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x  10“3 


(a)  down-stream  velocity 


x  10“3 


(b)  cross-stream  velocity 


Figure  5:  Comparison  of  the  a)  down-stream  velocity  perturbation  v!  and  b)  cross-stream 
velocity  perturbation  v'  for  the  isolated  mountain  at  ground  level.  Agreement  between 
the  numerical  spectral  element  model  and  the  analytical  formulas  given  by  Eq.  (28)  is 
satisfactory,  especially  for  the  cross-stream  velocity  perturbation. 
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Figure  6:  Comparison  of  the  density  perturbations  p  for  the  isolated  linear  hydrostatic 
mountain  using  1)  NUMA  and  2)  a  contour  integral  solution  [34].  In  panel  a),  the  p ’ 
contours  in  the  plane  y  =  0  are  shown.  In  panel  b),  p’  is  shown  as  a  function  of  z  using 
x  =  y  =  0. 


show  that  the  semi-implicit  Schur  integrator  is  more  efficient  than  the  RK35 
integrator  in  serial  mode  for  all  applicable  Conrant  numbers.  Figure  11 
displays  CPU  times  versus  Courant  numbers  for  both  the  a)  linear  hydrostatic 
mountain  and  b)  the  3D  rising  thermal  bubble  using  the  explicit  RK35  time- 
integrator  and  the  semi- implicit  BDF2  integrator.  Examining  panel  a),  we  see 
the  BDF2  integrator  continues  to  gain  efficiency  until  we  reach  the  maximum 
Courant  number  of  4.88;  for  Courant  numbers  larger  than  4.88,  the  semi- 
implicit  integrator  becomes  unstable  due  to  the  explicit  discretization  of  the 
non-linear  advective  terms.  Comparing  the  CPU  times  of  the  RK35  and 
BDF2  integrator  at  their  maximum  Courant  number,  we  see  that  the  semi- 
implicit  integrator  is  faster  by  a  factor  of  2.6.  A  similar  trend  holds  for  the 
3D  rising  thermal  bubble  problem;  however,  since  the  velocities  are  much 
smaller  for  the  bubble  problem,  the  semi- implicit  integrator  may  be  run  with 
a  much  larger  time-step.  Note  that  the  semi-implicit  integrator  begins  to  lose 
efficiency  when  the  Courant  number  is  increased  from  4.02  to  8.04;  in  this 
case,  the  number  of  required  GMRES  iterations  increases  from  15  to  30.  We 
attribute  this  behavior  to  the  GMRES  solver,  which  scales  quadratically  with 
respect  to  the  number  of  Krylov  iterations.  Hence,  for  both  test  problems, 
the  optimal  Courant  numbers  is  about  4. 
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Figure  7:  Down-stream,  cross-stream,  and  vertical  velocities  for  the  3D  linear  hydrostatic 
mountain  problem  in  a  neutral  atmosphere. 


Figure  8:  Streamlines  and  velocity  vectors  produced  by  flow  about  a  nonlinear  mountain 
with  h  =  1000  m. 
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Figure  9:  Horizontal  and  vertical  velocities  in 
mountain  with  h  =  1000  m. 
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Figure  10:  Evolution  of  a  3D  rising  therma?bubble  problem  (x  —  2-slices  of  the  potential 
temperature  perturbation  6')  in  the  y  =  500  m  plane  for  t  =  50,  100,  150,  200,  250,  and 
300  seconds. 


Figure  11:  Comparison  of  the  explicit  RK35  and  semi-implicit  BDF2  time-integrators  in 
serial  mode  for  a)  the  linear  hydrostatic  mountain  problem  and  2)  the  3D  rising  thermal 
bubble  problem.  In  both  cases  and  for  all  applicable  Courant  numbers,  the  BDF2-SI 
integrator  is  more  efficient. 

5-4-  Parallel  Performance 

Optimized  explicit  and  semi-implicit  versions  of  NUMA  have  been  de¬ 
ployed  on  TACC’s  Ranger  Sun  Constellation  cluster.  Ranger  consists  of  3,936 
16-way  SMP  compute  nodes  (blades)  with  4  AMD  quad-core  Opteron  proces¬ 
sors  per  blade  for  a  total  of  62,976  compute  cores  and  a  theoretical  peak  per¬ 
formance  of  579  Tera  FLOPS.  Two  test  problems  utilizing  the  rising  thermal 
bubble  with  fourth  order  polynomials  were  executed:  a  483  =  110592  element 
problem  for  processor  counts  ranging  from  16  to  512  and  a  643  =  262144  el¬ 
ement  problem  for  processor  counts  ranging  from  512  to  12288.  The  wall 
clock  time  associated  with  time-integration  was  then  recorded  for  each  run; 
the  startup  times  required  for  constructing  grids  and  data  structures  were 
omitted,  as  well  as  the  time  required  for  model  output.  A  time-step  of  0.001 
s  was  utilized  for  the  explicit  code,  whereas  a  time  step  of  0.01  s  was  used 
for  the  semi-implicit  code. 

Figure  12  displays  CPU  times  associated  with  these  scaling  problems. 
In  panel  a),  note  the  nearly  linear  scaling  of  NUMA  for  processor  counts 
ranging  from  32  to  512.  In  panel  b),  note  the  near-linear  scaling  of  the 
MPI  code  for  processor  counts  ranging  from  512  to  4096  (for  semi-implicit) 
and  12288  (for  explicit);  this  linear  scaling  is  attributed  to  1)  the  minimal 
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communication  costs  associated  with  spectral  elements,  and  2)  the  near- 
optimal  implementation  of  the  communication  algorithm.  Secondly,  note  the 
super-linear  speedup  for  processor  counts  ranging  from  1024  to  2048.  This 
super-linear  speedup  is  observed  once  the  local  problem  size  fits  inside  the 
L2  cache  memory. 

Comparing  the  computation  times  of  the  explicit  and  semi-implicit  codes 
in  Fig.  12,  we  see  the  semi-implicit  code  is  more  than  an  order  of  magnitude 
faster  than  the  explicit  code  for  the  time-steps  chosen.  This  speed-up  is 
attributed  to  two  factors:  1)  the  SI  time-integrator  admits  a  larger  time-step 
and  2)  the  BDF2-based  discretization  required  only  2  GMRES  iterations 
per  time-step,  whereas  the  RK35  integrator  requires  constructing  5  RHS’s. 
In  addition,  each  of  the  iterations  associated  with  the  Schur-complement  SI 
has  O  ( Np )  operations,  while  the  explicit  time-integrator  requires  O  (5 Np) 
operations  to  construct  a  RHS  where  Np  is  the  number  of  global  grid  points. 
Hence,  we  conclude  the  SI  integrator  is  more  efficient,  for  both  single  and 
multiple  processors,  relative  to  the  explicit  time-integrator. 

We  note  that  the  scaling  performance  of  the  semi-implicit  code  depends 
on  the  number  of  GMRES  iterations  required  for  convergence.  For  a  time 
step  of  0.01  s,  an  average  of  2  GMRES  iterations  per  time  step  are  required 
for  convergence.  Increasing  the  time  step  increases  the  number  of  required 
GMRES  iterations,  which  reduces  the  scalability  of  the  semi-implicit  code  to 
large  processor  counts.  Hence,  increasing  the  time-step  may  make  the  semi- 
implicit  algorithm  more  efficient  for  small  processor  counts  but  less  efficient 
for  higher  processor  counts.  The  optimal  time-step  for  our  semi-implicit  code 
is  hence  a  function  of  both  the  test  problem  at  hand  (which  determines  the 
number  of  GMRES  iterations)  and  the  available  architecture  (e.g.  number  of 
processors).  The  choice  of  this  optimal  time-step  will  be  studied  in  a  future 
work. 

6.  Discussion  and  Conclusion 

6.1.  Future  Work 

6.1.1.  Microphysical  Parameterizations 

In  conjunction  with  NRL  Monterey,  we  are  incorporating  microphysical 
parameterizations  into  NUMA.  Preliminary  experiments  using  the  Kessler 
scheme  [23]  have  been  conducted  within  a  2D,  serial  implementation  [8]. 
Since  physical  parameterizations  operate  on  columns  of  data  independently 
of  adjacent  data,  the  problem  is  embarrassingly  parallel  provided  the  domain 
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Figure  12:  Optimized,  explicit  RK35  and  semi-implicit  BDF2  METIS-based  spectral  ele¬ 
ment  code,  displaying  CPU  times  versus  processor  counts  ranging  from  a)  16  to  512  and  b) 
512  to  12288.  Near  linear  scaling  to  12288  processors  on  TACCs  Ranger  Sun  Constellation 
cluster  with  super-linear  speedup  observed  once  the  local  problem  size  fits  inside  the  L2 
cache  memory. 

is  decomposed  in  the  horizontal  only  such  that  all  z  values  reside  on  processor. 
To  facilitate  scaling  on  hybrid  shared-distributed  memory  architectures  (such 
as  TACC’s  Ranger),  hierarchical  domain  decomposition  is  desirable,  whereby 
MPI  communication  based  on  the  algorithm  developed  in  the  present  paper 
is  utilized  in  the  horizontal  and  either  OpenMP  parallelization,  appropriate 
for  shared  memory,  or  graphical  processor  units  (GPUs)  are  employed  for 
fine-grained  parallelism  in  the  vertical. 

6.1.2.  Global  Model 

As  stated  in  the  Introduction,  NUMA  is  a  unified  nonhydrostatic  dynam¬ 
ical  core  appropriate  for  both  limited-area  and  global  atmospheric  simula¬ 
tions.  In  this  respect,  NUMA  may  be  viewed  as  the  nonhydrostatic  successor 
to  the  Naval  Spectral  Element  Atmospheric  Model  (NSEAM)  [13,  10].  At 
present,  we  have  developed  3D  spherical  grids  based  on  the  cubed  sphere 
and  icosahedral  geometries  and  have  successfully  performed  rising  thermal 
bubble,  acoustic  wave  propagation  and  inertia-gravity  wave  propagation  ex¬ 
periments  on  these  grids.  In  a  companion  paper,  we  will  present  numerical 
results  for  the  standard  dry  dynamical  core  tests  such  as  the  Jablonowski- 
Williamson  baroclinic  instability  [21],  linear  and  non-linear  mountains  [39]  , 
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and  Held-Suarez  [16]  test  cases. 

6.2.  Conclusion 

In  this  paper,  we  have  developed  a  nonhydrostatic  unified  model  of  the  at¬ 
mosphere  (NUMA)  based  on  a  spectral  element  (or  continuous  Galerkin)  dis¬ 
cretization  in  space  and  a  suite  of  explicit  and  semi-implicit  time-integrators. 
This  model  is  suitable  for  simulations  of  both  limited-area  (mesoscale)  and 
global-area  simulations  of  atmospheric  phenomena;  this  paper  has  subjected 
NUMA  to  a  battery  of  limited-area  simulations,  including  inertia-gravity 
waves,  orographic  flow,  and  buoyant  convection  problems.  The  results  of 
these  test  problems  are  in  agreement  with  either  1)  previous  simulations,  2) 
analytical  results,  or  3)  physical  intuition. 

NUMA  is  targeted  towards  distributed  memory  architectures,  such  as  Sun 
constellation  clusters,  and  hence  requires  implementing  both  domain  decom¬ 
position  and  communication  algorithms.  These  parallelization  strategies  are 
based  on  a  general  graph-theoretic  approach  utilizing  METIS  graph  parti¬ 
tioning.  Since  spectral  elements  are  halo-free,  only  the  boundary  of  each 
processor  element  needs  to  be  communicated  during  the  DSS  operation.  De¬ 
noting  the  number  of  local  elements  by  Ne,  the  communication  bandwidth 

is  O  (^Ne^N2^  while  the  amount  of  on- processor  work  is  O  (-/VeA4);  yielding 

a  work-to-connnunication  ratio  of  O  (^Ne^N2^ .  Hence,  the  spectral  ele¬ 
ment  formulation  is  optimal  for  high  orders  of  discretization.  NUMA  has 
been  deployed  on  TACC’s  Ranger  Sun  Constellation  cluster  and  scalabil¬ 
ity  experiments  have  been  conducted.  These  experiments  reveal  near  linear 
scaling  to  12288  processors  for  the  explicit  code  and  4096  processors  for  the 
semi-implicit  code.  We  note  that  the  semi-implicit  algorithm  scalability  is 
adversely  affected  by  increasing  the  time-step  and  the  associated  number  of 
GMRES  iterations.  We  are  currently  working  on  scalable  preconditioners 
that  are  expected  to  improve  scalability  on  the  semi-implicit  method  [20]. 
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